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ABSTRACT 


The  empirical  model  builder  utilizing  regression  tech¬ 
niques  frequently  relies  on  the  coefficient  of  determination, 
2 

R  ,  to  measure  'goodness  of  fit*.  Costing  and  pricing  ana¬ 
lysts  using  such  variable  selection  techniques  frequently 

2 

encounter  inflated  R  values.  This  paper  examines  the  space 
within  which  the  regression  model  operates  and  presents  prac¬ 
tical  optimization  algorithms  to  help  assess  the  amount  of 

2 

confidence  that  can  be  placed  in  R  for  a  particular  set  of 

candidate  predictor  variables.  The  algorithms  describe  a 

technique  using  linear  programming  to  find  the  lowest  value 
2 

of  R  possible  using  the  given  set  of  data. 
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I .  INTRODUCTION 


A.  PROBLEM  MOTIVATION 

Parametric  cost  estimation  is  a  management  tool  used  to 
aid  in  the  prediction  of  the  cost  of  a  proposed  system.  It 
involves  predicting  the  cost  (dependent  variable)  of  a  system 
by  means  of  explanatory  (independent)  variables  such  as  sys¬ 
tem  characteristics  or  performance  requirements.  This  proce¬ 
dure  is  based  on  the  premise  that  the  cost  of  a  system  is 
related  in  a  quantifiable  way  to  the  system's  physical  and 
performance  characteristics .  The  expression  of  this  quanti¬ 
fiable  relationship  is  in  the  form  of  an  estimating  equation 
derived  through  statistical  regression  analysis  of  historical 
cost  data  on  systems  which  are,  more-or-less,  analogous  to 
the  proposed  system.  Since  parametric  cost  estimates  can  be 
developed  during  the  concept  formulation  stage  of  the  acqui¬ 
sition  process  before  engineering  plans  are  finalized,  these 
estimates  can  be  used  by  management  to: 

(1)  Identify  possible  cost/performance  tradeoffs  in  the 
design  effort. 

(2)  Provide  a  basis  for  cost/effectiveness  review  of 
performance  specifications . 

(3)  Provide  information  useful  in  the  ranking  of 
competing  alternatives. 

(4)  Suggest  a  need  for  investigating  new  alternatives. 

Cost  overruns  have  been  prevalent  in  the  acquisition 

process  for  new  weapon  systems  making  cost  estimation  a  very 


important  problem  for  all  components  of  the  Department  of 
Defense. 

To  combat  this  problem,  the  Department  of  Defense  has 
issued  directives  to  employ  independent  parametric  cost 
estimation.  Publications  such  as  .Reference  [1]  have  appeared 
which  give  step  by  step  methodology  for  the  development  of 
a  parametric  cost  estimate. 

Regression  problems  faced  by  costing  and  pricing  analysts 
in  these  situations  are  inherently  difficult  for  two  fundamen¬ 
tal  reasons  [Ref.  2] : 

(1)  The  number  of  observations  is  usually  small  compared 
with  the  number  of  system  characteristics  which  are 
candidate  components  of  the  regression  equation. 

(2)  The  available  data  is  not  produced  by  employing  an 
efficient  experimental  design. 

Under  these  circumstances,  it  has  been  shown  that  the  use 

of  variable  selection  techniques  may  result  in  regression 

2 

equations  which  yield  inflated  R  values  whose  statistical 

significance  cannot  be  tested  using  the  F-test.1 

In  general  parametric  cost  estimation,  an  analyst  should 

not  blindly  trust  the  regression  equation  resulting  from  his 

analysis.  To  measure  the  'goodness  of  fit',  the  analyst  can 

2 

use  such  statistics  as  R  and  F.  (As  noted  earlier,  however, 
regression  models  for  cost  estimation  often  do  not  allow  use 
of  F.)  There  are  few  hard  and  fast  rules  for  assessing  the 

1  2 

This  is  the  case  when  R  is  not  significant,  and  some, 
but  not  all,  of  the  0^  are  significant. 


usefulness  of  such  a  model.  This  is  especially  true  of  models 
that  result  from  the  application  of  a  variable  selection  tech¬ 
nique  in  order  to  obtain  a  •best'  prediction  equation.  The 
2  ... 

R  statistic  in  these  situations  may  not  give  a  meaningful 
indication  of  the  model's  applicability. 

The  purpose  of  this  paper  is  to  investigate  the  coeffi- 

.  .  2 

cient  of  determination,  R  ,  used  xn  best  subset  regression 

analysis.  The  solution  algorithms  presented  in  this  paper 

provide  a  practical  method  to  help  assess  the  confidence 

2 

placed  by  the  empirical  model  builder  in  R  for  a  regression 
upon  a  particular  set  of  exogenous  data.  It  may  contribute  to 
the  theoretical  foundation  for  the  understanding  of  regression 
models,  whose  properties  are  not  fully  understood. 

B.  STATEMENT  OF  THE  PROBLEM 

Suppose  the  analyst  selects  n-independent  observations  on 
p-predictor  (candidate)  variables  and  one  dependent  variable. 
The  goal  of  the  analysis  is  to  determine  the  k-variable  re¬ 
gression  equation  which  maximizes  the  coefficient  of  deter¬ 
mination  for  various  values  of  k.  The  difficulty  with  this 

2 

analysis  is  assessing  the  statistical  significance  of  R  for 
a  given  value  of  k. 

The  n-independent  observations,  mathematically,  span  a 
n-dimensional  finite  vector  space  (call  it  En) .  The  regres¬ 
sion  procedure  projects  the  dependent  variable,  Y,  onto  sub¬ 
spaces  within  En  looking  for  the  best  fit  (prediction) .  How 
the  subspaces  are  oriented  in  En  therefore  dictates  the  quality 


of  fit  that  can  be  obtained.  The  subspaces  for  E  are  deter¬ 
mined  by  the  candidate  predictor  variables;  each  predictor 
variable's  observed  values  representing  a  vector  in  En ,  and 
each  combinatorial  set  of  the  p-predictor  variables  spanning 
a  subspace.  Obviously,  there  are  (j^)  possible  k-variable  pre¬ 
diction  equations.  Therefore,  the  analyst's  selection  of  the 
p- variables  to  be  used  as  candidates  determines  which  sub¬ 
spaces  are  available  for  the  regression  procedure  to  consider 

2 

in  best  subset  selection. 

Wallenius  [Ref.  3]  in  trying  to  gather  information  on  the 

unknown  distribution  of  R2  asks,  "How  well  do  the  (^)  subspaces 

spanned  by  all  the  subsets  of  columns  of  X  'fill'  En?"  (Where 

X  is  the  (n  x  p)  matrix  of  n-observations  on  the  system's 

p-characteristics.)  In  other  words,  using  the  candidate  pre- 

2 

dictor  variables  selected  by  the  analyst,  will  the  highest  R 
value  obtainable  through  regression  differ  very  much  from  the 
worst  possible? 

Wallenius  characterized  this  problem  mathematically  by 
defining  the  'coefficient  of  fill'  (COF)  as  follows: 


Wx'k> 


nin  max  K.ix  x  x 

-  11l"“1kJ  X1  2 


This  formula  can  be  interpreted  to  ask,  "If  given  some  set 
of  candidate  predictor  variables,  what  is  the  worst  Y  that 


^The  total  number  of  exogenous  variables  that  Y  is 
regressed  upon  is  p. 


can  be  predicted?"  Where  'worst'  can  e  identified  by  the 

2  2 
lowest  R  value  obtainable.  Thus,  a  lower  bound  for  R  is 

also  obtained  by  answering  this. 

The  problem  is  extremely  difficult  to  solve  directly.  Thi 
paper  proposes  an  algorithm  for  solving  this  problem  using 
optimization  with  a  surrogate  objective  function.  The  number 
of  optimizations  required  to  obtain  a  global  solution  is 
exponentially  related  to  p  (the  number  of  predictor  variables) 
Thus,  in  the  area  of  p  =  14,  enumeration  begins  to  become 
economically  infeasible  as  a  solution  technique,  and  hence, 
an  algorithm  is  proposed  to  search  the  area  about  some  mini¬ 
mum  point.  This  latter  algorithm  cannot  guarantee  a  global 
solution,  so  it  must  be  considered  local  in  nature.  Whether 
such  a  local  solution  is  useful  requires  further  research; 
usefulness  may  be  directly  dependent  upon  the  empirical  nature 
of  the  data. 


SUPPORTING  CONCEPTS 


A.  GENERAL  LINEAR  REGRESSION  USING  LEAST  SQUARES 

A  general  and  frequently  used  linear  model  is  the  'multi 
pie  linear  regression  model*.  It  can  be  represented  as 
follows: 


-  3, 


BlXil  +  B2Xi2  + 


6  x.  + 
n  in 


ei 


The  variable  represented  by  y  is  the  variable  of  interest 
(i.e.,  to  be  predicted).  The  variables  represented  by  the  x^ 
are  associated  with  y  and  may  influence  the  behavior  of  y. 

Thus,  mathematically  y  is  called  the  dependent  variable 
(endogenous)  and  the  x  variables  are  called  independent  varia¬ 
bles  (exogenous) .  Statistically,  this  model  is  referred  to 
as  the  regression  of  y  on  the  x  variables.  The  coefficients 
6  are  referred  to  as  'partial  regression  coefficients'  and 
they  specify  the  linear  functional  relationship  between  the 
independent  variables  and  the  dependent  variable.  Mathe¬ 
matically,  the  8j  are  the  partial  derivatives  of  the  functional 
relationship  3y/3Xj.  Thus,  a  3  ^  indicates  the  change  in  the 
dependent  variable  y  corresponding  to  a  unit  change  in  the 
independent  variable  x^  (all  other  independent  variables  held 
fixed)  . 


There  are  various  criteria  used  in  regression,  however  the 
formulation  of  interest  for  this  paper  is  based  upon  the  least 


squares  criteria.  Use  of  least  squares  to  derive  the  formula 
for  estimating  Y  is  shown  below. 

Using  matrix  notation,  the  regression  model  can  be  written 

as : 

Y  =  X0  +  £  , 

where 


(a)  X  is  the  n  x  (m+1)  matrix  of  n-observations  on 
m-independent  (x)  variables  plus  a  dummy  variable. 

(b)  Y  is  the  column  vector  of  the  n-observed  values  of  Y. 

(c)  S_  is  the  column  vector  of  the  m+1  partial  regression 

A 

coefficients  (whereas,  |3  is  the  vector  of  estimated 
regression  coefficients) . 


14 


Formulating  the  least  squares 


min  £T£  =  ( Y-XB) T ( Y-XS)  , 

taking  the  derivative  with  respect  to  6. 

|j[YTY  -  2YTXB  +  bVxB]  =  0  , 

and  then  solving  for  j3,  we  get  the  estimate 

A  T  -1  T 

B  *  (XAX)  Ax  Y  . 

B.  STANDARDIZATION  AND  NORMALIZATION  OF  VECTORS 

The  regression  coefficients  of  the  linear  model  are  func¬ 
tions  of  the  units  of  measurement  of  the  variables.  The 
magnitudes  of  coefficients  are  influenced  by  choices  of  units 
of  measurement.  Thus,  a  tantamount  scaling  problem  to  that 
experienced  in  linear  programming  exists.  This  scaling  prob¬ 
lem  is  avoided  by  use  of  'standardized  regression  coefficients 
Standardized  regression  coefficients  are  the  end  result 
when  the  variables  which  they  are  estimated  from  have  been 
transformed  to  unit  variance. 

Consistent  with  previous  notation,  and  for  use  further  on 
in  this  paper,  an  alternative  form  to  obtain  'standardized' 
predictions  of  the  dependent  variable  y^  for  all  i  (repre¬ 


sented  as  Y*)  is: 


-f-(Y-yl),  or  Y  *  ~-(Y  -  Y) 

°Y  °Y  ~  ” 


(a)  (*)  denotes  'standardized*. 

(b)  X  denotes  a  (n  x  1)  column  vector  of  ones. 

Of  great  importance  is  that  a  linear  transformation  has 
been  performed,  and  that  the  intercept  of  the  regression 
equation  is  zero  (i.e.,  the  equation  passes  through  the  ori¬ 
gin)  .  Each  partial  regression  coefficient  indicates  how  many 
standard  deviation  changes  in  y  are  associated  with  one  standard 
deviation  change  in  the  corresponding  x  (all  other  x^  held 
fixed) .  Also,  a  mathematical  characteristic  is  that 


l  yi 

i=*l  x 


=  0 


or,  equivalently, 


1Y*  =  0  . 


Mathematically,  normalization  is  a  linear  transformation 
that  takes  any  given  vector  and  converts  its  length  to  unit 
length  (length  in  multidimensional  vector  spaces  discussed 
in  the  next  section) .  A  vector  of  unit  length  is  said  to  have 
a  norm  *  1.  Any  arbitrary  vector  can  be  transformed  into  a 
unit  vector  by  dividing  it  by  its  norm. 

A  vector  with  unit  length  can  be  depicted  as  follows: 


l  y.  »  1  .  or  YTY 


C.  LENGTH,  ANGLE,  AND  COSINE  FUNCTION  IN  MULTIDIMENSIONAL 

VECTOR  SPACES 

Using  the  concept  of  inner  products,  length  or  magnitude 
of  a  vector  can  be  defined.  In  this  context,  the  length  is 
referred  to  as  the  'norm'  of  the  vector. 

The  norm  of  the  arbitrary  vector  (x^,x2, . . . ,xn)  in  Rn 
is  denoted  by 

|  I  (x1,x2,...,xn)  I  I  =  /(x1,x2, . . .  ,xjj(x1,x2, . . .  ,xn)  . 

In  matrix  notation, 


|  |X|  |  *  (XTX)1/2  . 


Thus,  a  normalized  vector  would  be  created  as  follows: 

-  1  X 

5  ’  -  ‘  TTUT  "  (xTwV2  ' 

To  obtain  the  cosine  of  the  angle  between  two  vectors, 
one  normalizes  each  vector  and  then  takes  their  inner  product 
If  u  and  v  are  vectors  in  Rn,  the  cosine  of  the  angle  can 
be  defined  as: 


cos  9 


V 


Two  vectors  are  orthogonal  if  and  only  if  their  inner  product 
is  zero.  This  means  the  angle  between  them  is  90°,  cos  9*0 


and  vu  *  0. 


D.  PROJECTIONS 


Let  u  and  v  be  vectors  with  angle  a  between  them.  The 
scalar  projection  (or  component)  of  v  in  the  direction  of  u 
is  defined  to  be  | |v| Jcos  a.  Geometrically,  it  can  be  visu¬ 
alized  as  in  figure  one . ^ 


Figure  1.  Scalar  Projection  of  V  onto  U 

Alternatively,  computation  is  easier  if  written  as: 

scalar  projection  of  v  onto  u  *  v  •  . 

This  is  the  inner  product  (dot  product  in  En)  of  v  with  the 
unit  vector  in  the  direction  of  u. 


^Note  that  the  dashed  line  is  perpendicular  to  u. 


The  vector  projection  is  the  scalar  projection  times  the 
mixt  vector  in  the  direction  of  u.  So#  the  vector  projec— 
tion  of  v  onto  u  is  written: 


projuv 


TTSTT 


i 


j  v | | cos  a  * 


(v  • 


_u 

u 


) 


_u 
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PROBLEM  FORMULATION  ONE 


Recalling  from  Chapter  II  the  expression  for  estimating 
the  regression  coefficients, 

A  T  -1  T 

8  =  (XAX)  ax  , 


there  is  a  unique  solution  for  6_  if  the  matrix  X  X  is  non¬ 
singular,  that  is,  if  it  is  full  rank.  This  is  the  case  if 
X  has  n- independent  columns,  since  the  columns  of  X  X  and 
the  rows  of  X  span  the  same  space  [Ref.  4]. 

Thus,  given  -  (n  x  p)  matrix  X  of  full  rank  and  n  <  p,  a 
finite  dimensional  vector  space  En  can  be  defined  where  n- 
linearly  independent  column  vectors  span  the  space  (i.e.,  fo: 
a  basis) .  Extending  linear  algebra  concepts  to  the  linear 
regression  model  of  Section  II. A,  a  geometric  interpretation 
is  that  k  columns  (where  k  n)  of  the  X  matrix  span  a  k- 
dimensional  subspace  in  the  n-dimensional  space  En.  Further 
the  least  squares  procedure,  through  'best*  subset  selection 
will  select  amongst  the  (^)  columns  of  X  a  k-dimensional  sub¬ 
space  of  En  to  predict  the  vector  of  dependent  variables  (Y) 
such  that  cos  9  *  R  is  maximized  (9  is  minimized) ;  where  9 


is  the  angle  between  Y  and  its  orthogonal  projection  onto  a 
candidate  subspace. 


A.  RESTATEMENT  OF  'COEFFICIENT  OF  FILL'  (COF) 


Consider  the  matrix  X.  Assume  it  has  rank  *  n.  Next, 
require  that  Y ,  the  vector  of  dependent  variables  be  obtained 
through  standardized  regression  coefficients  (see  Section  II. B) 
that  is , 


Y*  =  ~(Y-Y)  . 

°Y 

This  requirement  causes  no  loss  in  generality  since, 

2  _  2 
x  * ;  ( x^  ,  .  •  •  Xj^)  ”Y  #  ( X^  ,  •  •  .  Xy. ) 

Further,  by  requiring  Y*  to  be  a  unit  vector  (normalized) ,  a 
unique  y  is  specified  (i.e.,  all  other  vectors  in  the  direction 
of  Y  will  be  a  scalar  multiple  of  the  normalized  vector) ;  that 
is 


Y*Ty*  *  1  . 

The  'coefficient  of  fill'  (defined  in  Section  I.B)  now  becomes 


’Wx'k> 


Min  Max 

1TY  =  0  { ij^ , . . .  ik } 

y *Ty*»l 


[Ref.  3]: 


B.  TRANSFORMATION  OF  THE  COF  TO  A  QUADRATIC  FORM 

2 

Wallenius  [Ref.  3]  showed  that  the  R  portion  of  the  COF 
can  be  transformed  into  a  quadratic  form.  Such  a  transforma¬ 
tion  can  be  done  as  follows. 

Identify  arbitrarily  any  combination  of  the  (^)  columns 
+-v» 

of  X.  Let  the  i  such  combination  be  called  D^.  Recall  that 

the  columns  of  define  a  subspace  in  En  and  any  vector  in  En 

can  be  projected  onto  that  subspace.  Assume  that  has  rank  k 


Figure  2.  Subspace  Spanned  by  D 

From  linear  algebra  (vector  projections)  and  figure  two,  it 
follows  that 

Z.  -  proj  Y*  *  l  (Y*-X  )X 
1  ai  i=l  1  1 

A  /K 

where  X.  ,...,x.  is  an  orthonormal  basis  for  D.  . 


From  least  squares  regression,  an  alternate  calculation 


for  is  as  follows : 


T  _  1  ni  * 

-i  *  Di(DiDi)  iD^Y 


Recalling  that 


a 


T  -IT* 
(DTD.)  iD.Y 
11  1- 


i 


and  that  the  'generalized  inverse'  of  a  matrix  is 


# 


the  calculation  of  Z i  can  be  stated  as: 

A  A  ♦ 

— i  “  Di^i  where  8_i  ■  D^Y  . 


In  Section  II. C,  the  cosine  of  the  angle  between  two  vec 
tors  was  defined.  To  calculate  the  angle  between  Y*  and  its 
orthogonal  projection  Z^,  use 


Through  substitution  and  simplification,  this  can  be  written 


cos  9^ 


Recalling  from  Section  II. B  that  Y*-Y*  =  1,  and  squaring  both 
2 

sides,  cos  9^  can  be  written  as  follows: 


2n 

cos  9i 


^  iftn  ^  .  it 

IlDjSjr  *  td!d;oidjy 

i  i ..  *  i  i  2 


-  i  lll- 

"I 


Thus, 


4*,^  -  cos20i  • 


Matrix  algebra  then  allows  the  following  conversion  for 


V.-Di5 


T 

D.D.D. 

Ill 


—  rn  n~n  1^  =  d? 


[(DiDT)iDi]A  =  [DiDiDi] 


*Y*;D. 


*T  -T  T  * 
Y  iD^iD^Y 


Y*(DiD”)Y* 


Noting  that  D^T  is  symmetric,  and  letting 


B 


i 


DiDi 
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we  can  write 


Thus ,  the  COF  aan  be  re-expressed  as : 


min  max  Y  TB .Y*  (COFQ) . 

Y*  i 

C.  A  GAME  INTERPRETATION  OF  THE  COFQ 

The  COFQ  can  be  viewed  as  a  game  between  a  person  and 

nature.  The  person  tries  to  choose  the  matrix  that  will 
2 

maximize  R  ,  and  nature  plays  the  part  of  an  antagonist  who 
wants  to  create  the  least  favorable  Y  to  be  predicted  by 
that  B^ . 

The  game  is  visualized  as  depicted  in  figure  three.  For 
a  fixed  Y,  the  regression  'black  box'  (labled  '1')  determines 

A 

the  0_  whose  coefficients  express  Y  as  a  linear  combination 
of  the  'best'  subset  of  independent  variables  (columns  of 
matrix  X) ,  D^. 


2 
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Figure  3.  Game  Interpretation  of  COFQ 


Nature  upon  seeing  the  subspace  represented  by  the  matrix 
B^,  antagonistically  chooses  the  worst  Y  in  En  so  as  to  thwart 
the  regression  model's  validity.  This  is  represented  in  figure 
three  as  the  box  labled  '2'. 

At  first  glance,  this  view  of  the  problem  hints  that  there 

exists  an  iterative  solution  to  the  COFQ.  Movement  towards 

such  a  solution  might  be  measured  by  the  generation  of  a  se- 

2 

quence  of  successively  lower  values  of  R  for  the  COFQ.  How- 
eve  obtaining  an  optimal  Y  is  dependent  on  whether  the  sequence 
generated  is  convergent.  Using  Zangwill's  general  convergence 
theorem,  it  can  be  shown  that  this  process  will  not  generate 
a  convergent  sequence  [Ref.  5] .  For  every  selected,  nature 
will  find  a  vector  Y  orthogonal  to  that  corresponding  sub¬ 
space.  It  is  the  case  that  in  a  finite  dimensional  vector 
space,  for  any  subspace  with  rank  less  than  the  vector  space 
itself,  there  exists  a  vector  orthogonal  to  that  subspace. 

This  is  shown  in  figure  4  for  3-dimensions,  where  the  subspace 
formed  by  and  x 2  is  of  dimension  2.  Vector  Y  is  normal  to 
the  x^,X£,  plane. 

D.  FORMULATION  AS  A  NON-LINEAR  PROGRAMMING  PROBLEM  (NLP) 

FOR  OPTIMIZATION 

In  Section  III.B,  it  was  shown  that  the  COF  has  an  equiva¬ 
lent  representation  using  a  quadratic  form  for  the  objective 
function.  The  matrix  B^  in  the  COFQ  form  is  a  square  symmetri¬ 
cal  matrix  of  size  (n  x  n) ,  and  has  rank  k.  In  general,  a 
quadratic  function  F(x)  has  the  form 
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for  a  constant  matrix  G,  column  vecto-  C,  and  scalar  a 
(multiplication  by  1/2  is  included  in  the  quadratic  term  to 
avoid  the  appearance  of  a  factor  of  two  in  the  derivatives) . 

The  quantity  'G'  is  referred  to  as  the  Hessian  matrix  of  F, 
which  is  the  matrix  of  second  partial  derivatives.  Thus,  Bi 
can  be  characterized  as  the  Hessian  matrix  of  a  quadratic  form. 
1.  Modeling  of  the  COFQ  in  Stages 

The  solution  procedure  for  the  COFQ  can  be  broken  down 
into  several  stages. 

(a)  For  a  particular  ,  optimize  the  quadratic  form 

* 

to  find  the  minimum  Y  . 

(b)  Use  an  algorithm  for  optimal  subset  selection  to 
look  at  some  regions  only  and  thereby  avoid  optimizing  over 
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all  (^)  possible  B^ .  This  is  based  upon  the  reasoning  that 

some  (along  with  the  associated  constraints  to  be  specified 

in  the  next  section)  will  define  convex  regions  which  are 

larger  than  others  and  that  the  larger  regions  will  produce 

a  Y  which  is  'worse'  than  smaller  regions  will.  Amongst  those 

* 

regions  over  which  optimization  was  done,  select  the  Y  corres- 

2 

ponding  to  the  smallest  R  value  as  the  global  minimum.  These 
steps  are  addressed  in  the  remainder  of  this  chapter. 

2 .  Modeling  as  a  NLP 

This  section  addresses  stage  (a) ;  optimizing  with  a 
particular  Bi  to  find  the  minimum  Y*.  Each  optimal  Y*  corres¬ 
ponding  to  a  B.  is  a  local  optimum  for  the  overall  COFQ  problem. 

Modeling  this  stage  as  a  NLP  (non-linear  programming) 
problem,  the  objective  function  becomes: 

*<T«  * 

min  Y  B.y_  . 

From  Section  III. A,  the  constraints  require  Y  to  be  of  unit 

length  and  that  Y  be  obtained  through  use  of  standardized 

regression  coefficients.  Assume  the  latter  is  met,  hence  the 
* 

use  of  Y  in  lieu  of  Y.  The  unit  length  requirement  was 
stated  as 


*fT<  * 

Y  Y  *  1  , 

and  the  latter  constraint  was  represented  by 
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*  0  . 


The  NLP  now  becomes: 


*T  * 

min  Y  B^Y 


*rr  * 

S.t.  Y  Y  =  1 


T  * 

1  Y  =  0 


i  free;  i  =  1, . . . ,n 


Note  that  the  are  free  variables ,  otherwise  the  standardi¬ 
zation  requirement, 


n 


could  not  be  met. 


The  characterization  of  this  NLP  is  that  of  a  NLP  with 

a  non-linear  equality  constraint,  referred  to  in  the  literature 

as  a  NEP  [Ref.  6].  The  constraint  is  a  quadratic  form  for 

which  the  Hessian  is  the  identity  matrix.  In  searching  for 

* 

the  locally  optimal  Y  ,  the  optimization  search  must  move 
along  a  quadratic  surface  at  unit  length  from  the  origin  of  a 
n-dimensional  hypersphere.  (Recall  from  Section  II. B  that  the 
intercept  of  the  regression  equation  is  zero.)  This  may  be 
visualized  in  3-dimensions  as  depicted  in  figure  5.  This 


f 


Figure  5.  Movement  of  Y  along  Quadratic  Surface 

constraint  requirement  to  move  along  a  quadratic  surface 
creates  a  non- convex  NLP/  which  correspondingly  increases  the 
optimization  complexity.  Search  direction  methods  for  uncon¬ 
strained  nonlinear  optimization  are  nonapplicable.  Further, 
reduced-gradient  and  gradient-projection  methods  developed  for 
problems  with  nonlinear  constraints  will  probably  fail  due 
to  the  non- convexity . 

The  requirement  to  solve  many  NLPs  as  the  are  changed 
necessitates  an  optimization  procedure  that  is  general  in  nature 


IV.  PROBLEM  FORMULATION  TWO 

The  inherent  difficulty  with  formulation  one  prompted  a 
search  for  a  computationally  simpler  model.  By  investigating 
the  geometrical  representation  of  the  COF,  could  a  linear 
model  be  built  that,  when  solved,  would  solve  the  original 
problem?  If  an  exact  solution  to  the  original  problem  was 
not  reached,  could  an  approximation  be  found  that  meets  the 
'practical  need'  for  the  COF  value  as  expressed  in  the  Intro¬ 
duction  of  this  paper? 

Such  a  linear  model  could  utilize  the  extensively  known 
results  in  linear  programming;  especially  capitalizing  on 
the  speed  of  computation  using  existing  linear  programming 
algorithms . 

A  similar  methodology  as  was  used  in  Section  III.D  to 
break  the  COFQ  problem  into  stages,  will  be  used  in  the  solu¬ 
tion  approach  to  the  COF.  One  stage  is  optimizing  to  find  the 
* 

minimum  Y  s  locally  through  the  use  of  a  surrogate  objective 
function.  The  second  stage  is  either  determining  the  global 
minimum  Y*  through  enumeration,  or  selecting  the  local  Y* 
most  attractive  as  the  answer  (thereby  approximately  solving 
the  COF) .4 


^Enumeration  seems  economically  feasible  into  the  neighbor¬ 
hood  of  approximately  14  candidate  predictor  variables. 
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A.  GEOMETRICAL  INTERPRETATION 


T 

Let  3i  s 

vectors  X  that  satisfy 


be  a  non- zero  vector. 


Consider  the 


aTX  =  d 

for  some  scalar  d.  The  set  of  X  that  satisfies  this  is  de¬ 
fined  to  be  a  hyperplane.  The  vector  a  is  termed  the  normal 
to  the  hyperplane,  and  the  normalized  vector 

a 


which  has  Euclidean  length  unity,  is  said  to  be  the  'unit 
normal1  to  the  hyperplane. 

One  can  think  of  a  hyperplane  as  a  shift  from  the  origin 
of  the  (n-1) -dimensional  subspace  orthogonal  to  a  [Ref-  7], 
This  can  be  seen  in  figure  6 .  Note  that  if  d  *  0 ,  the 
hyperplane  (subspace)  passes  through  the  origin.  This  can 
be  seen  for  three  dimensions  in  figure  7. 

Recall  from  Section  I.B  that  the  column  vectors  of  the 
matrix  X  (without  dummy  variable)  used  in  linear  regression 
define  a  n-dimensional  finite  vector  space  called  En.  Fur¬ 
ther  recall  that  combinatorial  sets  of  k  such  vectors  span 
subspaces  of  En.  If  the  assumption  of  linearity  is  valid  for 
the  regression  model,  then  interpretation  of  such  subspaces 
as  being  linear  is  valid.  Thus,  let  each  normalized  column 


1 


vector  of  X  represent  the  'unit  normal'  to  a  (n-1) -dimensional 
hyperplane.  Next,  view  each  such  hyperplane  as  a  constraint 
of  the  form^ 

T 

-  d  ;  x  = 

where  i  denotes  the  i***1  column  of  the  data  matrix  of  n-obser- 

T 

vations  on  p-predictor  variables,  and  X  =  (x^,Xj , . . . ,xn) . 

Thus,  the  general  equation  for  the  i^1  constraint  can  be 
written  as  follows: 


aljxl  +  a2  jx2  + 


+  a.(x.  •  d 

nj  n 


where 

d  =  0  for  i  =  1, . . . ,p  . 

Collectively,  these  p  constraints  (or  hyperplanes)  inter¬ 
sect  within  the  hyperspace  and  form  convex  polytopes?  specif¬ 
ically,  cones.  An  elementary  example  of  this  is  shown  in 
figure  8.  Any  vector  in  En  originating  at  the  origin  (to 
include  Y,  the  vector  of  interest)  will  geometrically  lie 
within  some  such  cone  as  defined  by  a  set  of  hyperplanes. 


^A  notation  switch  has  taken  place  to  enable  an  easy 
transition  to  commonly  used  linear  programming  notation. 

This  X  does  not  represent  the  predictor  variables,  but  rather 
real  numbers  to  be  determined. 
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Figure  8.  Convex  Regions  Formed  m  2-Dimensions 


B .  ALGEBRAIC  MODEL 

Consistent  with  linear  programming  notation,  the  p- 
constraints  defined  by  their  unit  normals  can  be  written  as 
follows: 


allxl  +  a12x2  +  •  •  •  *  au,*„ 


a21xl  +  a22x2  +  •••  +  a2nxn 


aplxl  +  ap2x2  +  • • •  +  ap„x„ 


This  system  of  constraints  can  be  put  into  the  standard 
form  AX  *  b ,  where  b  *  0 .  Matrix  A  is  the  matrix  of 


-  I. 


coefficients  and  is  actually  the  transpose  of  the  normalized 
column  vectors  from  the  independent  (predictor)  variable  data 
matrix. 

Using  the  concepts  of  halfspaces,  specifying  _<  or  ^  in 
each  relationship  above  determines  on  which  side  of  each  con¬ 
straint  a  point  in  En  lies.  Putting  the  system  of  constraints 
into  standard  form  will  then  require  a  slack  (surplus)  varia¬ 
ble  for  each  constraint. 

C.  CHARACTERIZATION  OF  THE  CONES 

Since  the  regression  model  projects  the  dependent  varia¬ 
ble  Y  onto  the  subspaces  (hyperplanes) ,  we  want  to  know  for 
any  point  (vector)  Y  which  hyperplanes  are  the  closest  to  Y. 
This  can  be  rephrased  as  asking,  "Within  which  cone  (convex 
polytope)  does  Y  lie?" 

Consider  some  arbitrary  point  (i.e.,  vector  originating 
T 

from  origin)  Xq  =  (x^,x2, —  ,xn)  in  the  hypersphere.  Further, 
assume  that  the  points  are  constrained  in  the  distance  they 
can  be  from  the  origin,  so  as  to  not  have  an  infinite  ray. 

Now,  scanning  out  from  Xq,  it  can  be  seen  that  some  hyper¬ 
planes  are  'closer'  to  Xq  than  are  others  ('closest'  defined 
by  the  smallest  angle  0^  between  So  and  any  vector  in  a 
particular  hyperplane) .  Define  the  set  of  hyperplanes  which 
are  closest  to  Xq  when  considering  all  directions  as  'bounding 
hyperplanes ' ,  and  the  region  bounded  as  the  cone  within  which 
Xq  lies;  refer  to  such  a  cone  as  'hole  H^'.  The  specific 
sizes  and  quantity  of  holes  created  by  the  intersecting 
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hyperplanes  depends  on  the  values  given  in  the  problem  data. 
As  we  scan  out  from  Xg,  we  know  for  certain  that  the  closest 
hyperplane  is  one  of  the  walls.  Note,  however,  that  we  can¬ 
not  extend  this  argument  to  identify  the  other  walls  of  the 
hole  in  which  Xg  lies.  Thus,  the  question  remains  in  identi¬ 
fying  the  walls  of  the  hole  (bounding  hyperplanes) .  The 
concepts  for  reduction  of  linear  inequalities  presents  an 
answer  to  this  question  [Ref.  7]. 

1.  Reduction  of  Inequalities 

Suppose  for  any  arbitrary  point  X  in  En  (subject  to 
X  <_  upper  bound)  all  given  hyperplanes  are  examined.  Then 
there  exist  hyperplanes  'exterior'  to  the  convex  region 
defining  H^.  These  exterior  hyperplanes,  when  viewed  as 
constraints  are  non-binding.  As  an  elementary  example  of 
this,  a  hole  H.  is  depicted  in  figure  9  corresponding  to  the 

<JL 

following  system  of  constraints. 

<_  0  hyperplane  p^ 

>_  0  hyperplane  p2 

>_  0  hyperplane  p3 

<  X  <  U.B . 


T 

aJ_X 


T 

& 


T 

a*X 


L.B. 


As  can  be  seen  from  figure  9,  the  third  constraint  is 
exterior  to  H.  (the  hole  identified  by  the  letter  i)  and  can 


Figure  9.  2-Dimensional  Depiction  of  Non-Binding  p 

be  termed  'redundant'.  This  redundant  constraint  can  be 
eliminated  without  changing  the  solution  set. 

Now,  suppose  a  convex  region  is  specified  by  adding 
slack  variables,  and  putting  the  system  of  equations  in 
standard  form.  Thus  (assuming  artificial  variables  have 
been  forced  out) ,  we  get: 

a?X  +  s.  =0 


In  standard  form,  this  constraint  redundancy  is  reflected  in 


that  s^  is  a  'non-extremal*  variable  and  hence  it,  together 
with  the  third  constraint,  can  be  eliminated.  This  special 
example  generalizes  to  higher  dimensional  problems.  In 
general,  redundant  inequalities  show  up  as  having  non-extremal 
slack  variables. 

Here  the  L.B.  is  any  arbitrary  positive  real  number 
sufficiently  big  to  avoid  roundoff  problems,  yet  to  prevent 
convergence  at  the  origin.  The  following  algorithm,  using 
simplex  techniques,  will  identify  the  binding  constraints 
(walls)  for  any  hole. 

2 .  Algorithm:  Removal  of  Redundant  Constraints 

Assume  that  some  combination  of  'less  than  or  equal 
to'  and  'greater  than  or  equal  to'  inequality  signs  for  the 
p  constraints  are  specified.  Call  this  system  of  p  inequali¬ 
ties  the  'resource  constraints'  and  put  it  in  standard  form. 

a.  Find  a  feasible  solution:  min  f  A.,  where  A. 

i=l  1  1 

is  the  artificial  variable  corresponding  to  the 

. th  .  .  6 

1  constraint. 

b.  Let  S  *  {s1,s2, . . . ,sp};  the  set  of  slack  varia¬ 
bles  . 

c.  Select  from  S. 

^Failure  to  find  a  feasible  solution  for  that  combination 
of  (>>£)  signs  implies  that  a  convex  region  is  not  defined, 
and  therefore  that  combination  can  be  ignored. 


d.  Min  s^,  subject  to  resource  constraints  and 
variable  bounds. 

e.  If  s^  is  non-extremal  (s^  >  0) ,  place  index  i 
in  set  R. 

f.  If  set  S  has  been  exhaustively  examined,  go  to 
Step  g;  otherwise  increment  i  and  go  to  Step  c. 

g.  Remove  or  'fix'  all  constraints  i  s.t.  i  €  {R}. 

The  result  of  this  is  that  the  boundaries  of  the 

convex  region  within  which  a  specified  point  (vector)  lies 
have  been  identified.  Any  optimization  need  only  consider 
this  subset  of  the  resource  constraints. 

D.  OPTIMIZATION  FOR  LOCAL  MINIMA 

2 

The  COF  seeks  a  Y  such  that  R  is  minimized.  For  a  given 

Yg  and  the  corresponding  cone  within  which  Y^  lies,  this 

minimum  is  approached  as  Y  moves  away  from  the  nearest  hyper- 

2 

plane.  Recall  that  as  the  angle  of  projection  increases,  R 

decreases.  However,  past  a  certain  position,  the  angle  of 

projection  between  Y  and  a  different  hyperplane  will  decrease 

enough  such  that  the  regression  will  select  this  second  hyper- 

2 

plane  to  project  Y  onto  instead,  since  a  higher  R  value  can 
be  obtained. 

If  this  process  were  to  occur  between  Y  and  each  of  the 
walls  of  the  hole  simultaneously,  an  'equilibrium  point'  for 
Y  would  be  reached.  This  equilibrium  point  is  defined  to  be 
the  center  of  the  hole.  This  can  be  visualized  in  3-dimensions 
in  figure  10. 


Figure  10 .  Vector  Y  at  Equilibrium  Point 

It  is  at  this  equilibrium  point  where  if  Y  were  projected 
onto  any  of  the  walls,  that  the  angles  of  projection  would  be 
the  'worst  *  ?  we  are  maximizing  the  minimum  angle.  The  Y 
representing  this  point,  and  normalized  to  be  unique,  is  the 
worst  Y  to  be  predicted  using  regression  for  this  region  of 
En. 

The  search  for  a  local  minimum  Y  thus  becomes  a  search 
for  a  vector  that  originates  at  the  origin  and  passes  through 
the  center  of  the  hole.  Considering  the  hole  in  the  context 
of  a  convex  region,  we  are  searching  not  for  a  solution  at  an 
extreme  point,  but  rather  at  the  center  of  the  region. 


Viewing  the  previous  figure  from  'topside',  we  see  Y  as 
a  point  in  the  center  of  a  (n-1) -dimensional  convex  region. 


Figure  11.  Top  View  of  2-Dimensional  Convex  Region 

Such  an  equilibrium  point  can  be  approximated  by  optimizing 
using  slack  associated  with  the  extremal  constraints  repre¬ 
senting  the  hole.  The  objective  function  for  optimization 
can  be  formulated  as 

max  [min  C^X]  , 

X  i 

where  is  included  for  all  i  4  {R}.  Note  each  contains 
only  one  non-zero  coefficient — that  coefficient  representing 
the  slack  variable  for  the  ifc^  resource  constraint.  The  final 
X  obtained  is  the  optimal  value  of  Y. 

T 

The  original  form  of  the  COF  required  Y  Y  ■  1  (i.e.,  unit 
length) ,  and  that  for  the  COFQ  this  requirement  created  a 
non-convex  NLP.  In  figure  12  for  3-dimensions,  it  can  be  seen 
that  there  exists  a  hyperplane  'tangent'  to  this  quadratic 


2 


Figure  12.  Hyperplane  Touching  Surface  at  One  Point 

surface  such  that  the  closest  point  on  the  hyperplane  to  the 

*  * 

origin  is  the  optimal  Y  .  Then  the  vector  Y  is  the  orthogonal 
unit  normal  to  the  hyperplane. 

A  /\ 

Now,  let  the  initial  value  of  Y  (Y  is  normalized)  specify 
a  bounding  hyperplane  to  the  hole.  For  each  new  value  of  Y 
a  new  bounding  hyperplane  can  be  specified,  thereby  linearly 
approximating  the  quadratic  surface  in  successive  increments 
as  Y  moves  towards  the  center  of  the  hole  (optimal  Y) . 

Using  matrix  A  with  the  understanding  that  it  contains  only 
the  extremal  resource  constraints,  the  optimization  problem 
can  be  formulated  as 

T 

max  [min  C.X] 

X  i 

s.t.  AX  +  IS .  =  0 

—  —A 

akTX  <  1 
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X  free 


v  th 

where  a  »  values  of  X  for  the  (k-1)  iteration,  and  Y  is 

represented  by  the  current  value  of  X.  For  each  hole,  the 

A 

initial  Y  is  the  normalized  value  of  the  BFS  to  the  L.P. 
which  was  reached  using  the  algorithm  in  Section  C.2.  The 
values  of  Y  for  the  (k-1)  iteration  become  the  coefficients 
of  ak  (kth  iteration) .  Note  that  Y  is  normalized  between  each 
iteration.  The  quantity  'I'  is  an  identity  matrix  and  SA  is 
the  column  vector  of  slack  variables  corresponding  to  the 
extremal  resource  constraints  (hence  the  use  of  subscript  A) . 
This  problem  can  be  rewritten  in  final  form  as  the  L.P.: 


max  Z 


s.t.  AX  +  IS-  -  0 

—  —A 


Z 


X 


< 


1 


i  i  (R) 


>  0 

free. 


E.  OPTIMIZATION  FOR  A  GLOBAL  MINIMUM 

Recall  from  Section  IV. B  that  the  direction  of  the  inequality 
signs  for  the  resource  constraints  determines  which  convex 


set  is  the  feasible  region.  We  can  see  that  there  are  2P 
such  sets  to  consider  for  local  optimization  in  the  search 
for  a  global  minimum  to  the  COF. 


When  the  number  of  candidate  predictor  variables  is  small, 
say  up  to  about  p  =  14,  enumeration  is  economically  practical. 
For  problems  with  p  larger  than  this,  only  holes  which  look 
larger  than  most  of  the  other  holes  could  be  optimized  as 
'candidates'  for  the  global  minimum.  Such  a  selective  opti¬ 
mization  would  require  some  type  of  global  information  prior 
to  any  optimization.  A  procedure  to  find  such  information 
was  not  found.  This  impasse  led  to  the  algorithm  presented 
below.  This  algorithm,  when  given  a  local  optimum,  searches 
adjacent  holes  for  a  better  minimum.  If  none  better  is  found, 
it  stops;  otherwise,  it  uses  the  best  hole  and  continues  to 
search  from  there.  In  theory  it  might  search  all  2P  holes, 
but  this  is  doubtful  for  real  world  problems  (also,  an  itera¬ 
tion  stop  could  be  put  in  if  desired) .  Note  that  the  solution 
is  still  local  in  nature,  although  it  may  be  the  global  solu¬ 
tion.  A  heuristic  approach  may  have  to  be  developed  to  decide 
which  hole  to  use  to  begin  such  a  search. 

1.  Algorithm:  Searching  the  Neighborhood  about  Y  Min 
An  algorithm  to  search  the  neighborhood  about  a  local 
minimum  in  an  effort  to  find  the  global  solution,  and  if  not, 
to  find  a  better  minimum  (larger  Z  value) ,  is  as  follows: 

a.  Select  some  combination  of  inequality  signs  (£,>J 
for  the  resource  constraints. 


b.  Get  a  BFS  (basic  feasible  solution) ,  remove 
redundant  constraints,  and  then  maximize  Z 
(Z  defined  in  previous  section) . 

c.  Record  extremal  constraints  as  set  E  (along  with 
the  direction  of  the  inequalities) ,  and  record 
the  value  of  X*  and  Z. 

d.  For  the  constraint  of  E,  reverse  the  inequality 
(multiply  the  constraint  by  -1) .  Let  J  =  (k 
constraint,  reversed}  u  {all  resource  constraints  - 
k^}.  Optimize  over  J  as  in  step  b.  If  Z^  >  Z, 
record  as  in  Step  c  (use  label  other  than  E) , 

and  set  'NEXT'  *  k. 

e.  If  all  elements  of  set  E  have  been  exhaustively 
examined,  go  to  Step  f;  otherwise,  increment  k 
and  go  to  Step  d. 

f.  If  no  Z  for  set  E  is  better  than  the  initial  Z, 
STOP.  Otherwise,  let  J  *  {constraint  'NEXT'} 

u  {all  resource  constraints  -'NEXT',  reversed}. 

Fix  'NEXT'  so  as  to  not  reverse  its  inequality 
sign  again.  Go  to  Step  b. 

This  results  in  a  minimum  that  although  is  local, 
can  be  considered  the  best  solution  possible  in  that  region 
of  the  hypersphere. 

Using  the  algorithms  presented  and  the  solution  ob- 

2 

tained,  the  analyst  can  compute  the  corresponding  R  value 
by  forcing  the  regression  or  using  the  equations  in  Section 
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Ill .B.  Thus,  he  now  has  a  relative  measure  of  his  model's 

2 

applicability  through  comparison  of  the  R  value  pertaining 

to  his  prediction  equation  (obtained  through  least  squares 

2 

regression) ,  to  the  lowest  R  obtainable  for  that  set  of 
candidate  predictor  variables. 


V.  SUMMARY 


2 

The  empirical  model  builder,  in  utilizing  R  for  a  measure 

of  'goodness  of  fit',  needs  information  concerning  the  quality 

of  this  statistic.  This  paper  has  addressed  this  problem 

utilizing  optimization  techniques  to  help  the  model  builder 

2 

assess  the  amount  of  confidence  that  can  be  placed  in  a  R 
value  pertaining  to  a  particular  set  of  candidate  predictor 
variables . 

The  linear  programming  algorithms  presented  offer  a 

practical,  fast,  and  cost  effective  methodology  to  search 

the  hyperspace  in  which  the  linear  regression  model  will 

2 

operate.  The  lowest  value  of  R  (globally)  achievable  for  a 
particular  set  of  cost  data  can  be  found  when  the  number  (p) 
of  candidate  predictor  variables  is  small.  Whereas,  when 
the  number  of  variables  nears  fourteen  or  more,  a  local  mini¬ 
mum  must  be  accepted  due  to  computational  costs  inherent  in 
the  presented  methodology. 
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